Venice is drowning - final report

Abstract

The main objective of the proposed project is basically to analyze the data of the tide detections regarding the area of the Venice lagoon with the aim of creating and testing predictive models, assessed on a time horizon ranging from one hour up to a week of forecast.

For this purpose, the models created are three, two linear and one of machine learning:

  • an ARIMA (AutoRegressive Integrated Moving Average) model;
  • a UCM (Unobserved Component Models) model;
  • an LSTM (Long Short-Term Memory) model.

Datasets

The datasets used for the project pipeline are two: the first dataset contains the tides level measurements in cm in the Venice lagoon from a certain reference level, obtained through the use of a sensor, between 1983 and 2018 while the second holds the information regarding meteorological variables such as rainfall in mm, wind direction in degree at 10 meters and finally wind speed at 10 meters in meters per second in the periods between 2000 and 2019.

The tides level dataset is composed using the single historical datasets made public by the city of Venice, in particular from Centro Previsioni e Segnalazioni Maree. The data regarding the meteorological variables, instead have been provided, on request, by ARPA Veneto. All the preprocessing operations regarding parsing, inspection and the final union of the cited datasets are available in the following scripts/html realized with this order:

  • parsing_tides_data è lo script utilizzato per svolgere l’operazione di assemblaggio del dataset delle maree completo, importando ed unificando tutti i dataset annuali disponibili;
  • inspection contiene una serie di ispezioni preliminari rispetto i dati citati;
  • preprocess_weather_data_2000_2019 contiene invece le operazioni di processing riguardanti i dati meteorologici;
  • parsing_tides_weather riassume infine le operazioni svolte per risolvere il problema dei dati mancanti per quanto riguarda il dataset dei dati meteorologici ed il finale merge dei due file.

Come scelta progettuale, si è deciso in fase di processing dei dati di limitare l’utilizzo di questi al periodo tra Gennaio 2010 e Dicembre 2018.

Data inspection

Durante la fase di processing dei dati e prima di procedere all’effettiva realizzazione dei modelli di previsione, sono state prodotte una serie di visualizzazioni con lo scopo di ispezionare a fondo alcuni aspetti riguardanti la serie storica.

fig 1: Visualizzazione della serie storica con relativi plot di autocorrelazione
Figure 1: Time serie visualization with autocorrelation and partial autocorrelation plots


In fig. 1 è possibile osservare la serie storica completa, rappresentata insieme ai plot di autocorrelazione ed autocorrelazione parziale, mentre in fig. 2 si può verificare come il fenomeno delle maree sembrerebbe distribuirsi seguendo una distribuzione normale. In questo caso, infatti, i concetti di stazionarietà debole e forte si equivalgono.

distribuzione
Figure 2: Time serie distribution


Durante il lavoro di prima ispezione dei dati storici una delle verifiche effettuate ha riguardato la stazionarietà in media e quella in varianza: per quanto riguarda la prima osservando la serie in fig. 1 questa sembrerebbe effettivamente risultare stazionaria in media, per questa ragione per confermare l’ipotesi di stazionarietà in fig. 3 è riportato l’output di un test Augmented Dickey-Fuller che, come si vede, conferma a tutti gli effetti la stazionarietà in media del fenomeno mareale.

fig 3: Output augmented Dickey-Fuller test
Figure 3: output test ADF

Successivamente alla verifica della stazionarità in media si è proceduto a verificare la stazionarietà in varianza: appare chiaro dal plot riportato in fig. 4 come i valori medi di marea per giorno e la loro standard deviation non risultino in un trend crescente ma anzi seguano sostanzialmente un trend piatto.
fig 4: variance stationary
Figure 4: Visualization of stationarity in variance


Models

As anticipated, the models created will focus on two areas, one more purely statistical with linear models such as ARIMA and UCM and the other of machine learning, through the definition of an LSTM model. The preparations and implementations of the models will be presented below and finally a section of results will be proposed in which it will be possible to make a rapid comparison between the performance of the models on a test set defined a priori. Referring to that, it is worth highlighting the data used for both areas:

  • for the linear models the training set is composed by the last six months of 2018, from July to December;
  • for the machine learning one, considering the capacity of handle more data with constant computational time, the training set cover the period between January 2010 and December 2018.

The test set, previously extracted, refer to the last week of December 2018, i.e. from 24/12/2018 23:00:00 to 31/12/2019 23:00:00.

Figure 5
Figure 5: Train and test data representation


With reference to the linear models, two strategies are implemented: the former consist in integrate the meteorological variables with the lunar motion while the latter in extracting the principal periodic components exploiting oce, an R package that helps Oceanographers do their work by providing functions to read Oceanographic data files.

Regarding the first strategy, after processing the meteorological data as previously mentioned, using the API PyEphem, an astronomy library that provides basic astronomical computations for the Python programming language. Given a date and location on the Earth’s surface, it can compute the positions of the Sun and Moon, of the planets and their moons, and of any asteroids, comets, or earth satellites whose orbital elements the user can provide. In order to track the lunar motion all we have to do is to select the period of interest and the coordinates representing Venice.

plotly
Figure 6: Interactive plot representing lunar motion between 2010 and 2018


The second strategy instead, as anticipated, concerns the principal periodic components extractable from a time series about sea levels in order to be used as regressors for the tides level time serie. The oce package provide a function called tidem able to fit a model in terms of sine and cosine components at the indicated tidal frequencies, with the amplitude and phase being calculated from the resultant coefficients on the sine and cosine terms. Tidem provides the possibility to extract till 69 components but we focused on 8 of them, in particular:

  • M2, main lunar semi-diurnal with a period of ~12 hours;
  • S2, main solar semi-diurnal (~12 hours);
  • N2, lunar-elliptic semi-diurnal (~13 hours);
  • K2, lunar-solar semi-diurnal (~12 hours);
  • K1, lunar-solar diurnal (~24 hours);
  • O1, main lunar diurnal (~26 hours);
  • SA, solar annual (~24*365 hours);
  • P1, main solar diurnal (24 hours).
plotly
Figure 7: Interactive filtering plot for the extracted components

ARIMA

Both the realized linear models use the data between 01/07/2018 00:00:00 and 24/12/2018 23:00:00 as training set. This choice was determined by the needs of optimizing the models’ fitting time because, taking more data, there was an important temporal expansion. Both ARIMA and UCM are programmed using R, in particular with forecast and KFAS packages.

As first approach to the forecast task we trained two ARIMA models: the first is trained using as regressors the meteoro

UCM

LSTM

Inserire qui procedimento svolto LSTM

Dario Bertazioli, Fabrizio D’Intinosante

2020-01-22